Symbols

Here are some basic symbols that you will see in R:

<-

An assignment operator

x <- 1:5

=

An alternative assignment operator (not preferred by the purists)

y = 6:10

()

() will always be for a function – for example mean()

mean(x)
[1] 3

{}

{} always contain calls to created functions

function(x) {
  x + 5
}
function(x) {
  x + 5
}

[]

[] are used for indexing

# Gets the second thing
x[2] 
[1] 2
# Gets the first row
cars[1, ] 
  speed dist
1     4    2
# Gets the first column
cars[, 1] 
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25
# Gets the value in the first row
# in the first column
cars[1, 1] 
[1] 4
# Gets the first 10 rows of the "dist" column
cars[1:10, "dist"]
 [1]  2 10  4 22 16 10 18 26 34 17

$

$ is a shortcut used for accessing names within a data.frame

cars$speed
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25

It is equivalent to:

cars[, "speed"]
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25

“” and ’’

“” and ’’ are both used to specify strings

z = c("string1", 'string2', "string3")

``

`` are used for making something literal

cars$`1 $ bad name` = 1:nrow(cars)

head(cars)
  speed dist 1 $ bad name
1     4    2            1
2     4   10            2
3     7    4            3
4     7   22            4
5     8   16            5
6     9   10            6

==

== is strictly equivalent

cars[cars$speed == 20, ]
   speed dist 1 $ bad name
39    20   32           39
40    20   48           40
41    20   52           41
42    20   56           42
43    20   64           43

!=

!= is not equal to

cars[cars$speed != 20, ]
   speed dist 1 $ bad name
1      4    2            1
2      4   10            2
3      7    4            3
4      7   22            4
5      8   16            5
6      9   10            6
7     10   18            7
8     10   26            8
9     10   34            9
10    11   17           10
11    11   28           11
12    12   14           12
13    12   20           13
14    12   24           14
15    12   28           15
16    13   26           16
17    13   34           17
18    13   34           18
19    13   46           19
20    14   26           20
21    14   36           21
22    14   60           22
23    14   80           23
24    15   20           24
25    15   26           25
26    15   54           26
27    16   32           27
28    16   40           28
29    17   32           29
30    17   40           30
31    17   50           31
32    18   42           32
33    18   56           33
34    18   76           34
35    18   84           35
36    19   36           36
37    19   46           37
38    19   68           38
44    22   66           44
45    23   54           45
46    24   70           46
47    24   92           47
48    24   93           48
49    24  120           49
50    25   85           50

|

is an or operator
cars[cars$speed == 18 | cars$speed == 20, ]
   speed dist 1 $ bad name
32    18   42           32
33    18   56           33
34    18   76           34
35    18   84           35
39    20   32           39
40    20   48           40
41    20   52           41
42    20   56           42
43    20   64           43

&

& means and

cars[cars$speed == 18 & cars$dist == 56, ]
   speed dist 1 $ bad name
33    18   56           33

%*%

%*% is matrix multiplication

x %*% y
     [,1]
[1,]  130

%%

%% is the modulus

10 %% 3
[1] 1

%>%

%>% is a pipe that passes the output of one function into another function:

library(dplyr)

1:10 %>% 
  mean()
[1] 5.5

See the magrittr package for more potential pipes.

Finding Packages

CRAN Task Views will help you find packages that do what you want to do. They are arranged into broad categories that then get broken down.

There is also rseek

Package Installation

Install new packages from CRAN with:

install.packages("tidyverse")

If you want to install multiple, you need to use a character vector:

install.packages(c("tidyverse", "psych", "Hmisc"))

If you want to install something from Github:

devtools::install_github("tidyverse/ggplot2")

Loading Packages

If you want to use a package, you load it into your local environment with:

library(dplyr)

Data Types

There are a few different types of data within R.

We have numeric variables:

nums = 1:10

nums
 [1]  1  2  3  4  5  6  7  8  9 10

We have characters:

chars = c("Poor", "Fair", "Good", "Great")

chars
[1] "Poor"  "Fair"  "Good"  "Great"

And we have factors:

facs = as.factor(chars)

facs
[1] Poor  Fair  Good  Great
Levels: Fair Good Great Poor

We can have just regular factors like above or we can order those factors:

orderedFacs = ordered(facs, 
                      levels = c("Poor", "Fair", "Good", "Great"))

orderedFacs
[1] Poor  Fair  Good  Great
Levels: Poor < Fair < Good < Great

Data Structures

At our most basic, we have a vector. Think of it a just a basic list of numbers:

vectorExample = rnorm(10)

vectorExample
 [1] -0.24579666 -0.04508388  1.59762042  0.05789691 -0.01449319
 [6] -0.15492893 -1.67147798  1.13017519 -0.48050509  0.24531824

There is the matrix. A matrix is a rectangle where everything is the same type of data:

characterMatrix = matrix(letters, nrow = 5, ncol = 5)

characterMatrix
     [,1] [,2] [,3] [,4] [,5]
[1,] "a"  "f"  "k"  "p"  "u" 
[2,] "b"  "g"  "l"  "q"  "v" 
[3,] "c"  "h"  "m"  "r"  "w" 
[4,] "d"  "i"  "n"  "s"  "x" 
[5,] "e"  "j"  "o"  "t"  "y" 
numericMatrix = matrix(1:25, nrow = 5, ncol = 5)

numericMatrix
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

If we have data of different types, we are dealing with data frames:

dfExample = data.frame(numVar = 1:3, 
                       charVar = letters[1:3], 
                       facVar = factor(c("Poor", "Good", "Great")), 
                       ordVar = ordered(c("Poor", "Good", "Great"), 
                                        levels = c("Poor", "Good", "Great")))

summary(dfExample)
     numVar    charVar   facVar    ordVar 
 Min.   :1.0   a:1     Good :1   Poor :1  
 1st Qu.:1.5   b:1     Great:1   Good :1  
 Median :2.0   c:1     Poor :1   Great:1  
 Mean   :2.0                              
 3rd Qu.:2.5                              
 Max.   :3.0                              

We also have lists. Lists can contain any number of any thing within the different list entries:

listExample = list(1, rnorm(10), numericMatrix, rnorm(30), dfExample)

listExample
[[1]]
[1] 1

[[2]]
 [1] -0.60319048 -0.55735485  0.16231938  0.39426799 -1.57423818
 [6]  2.35424746 -0.90826017  0.06369677  1.31953776  0.11824409

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

[[4]]
 [1]  0.21249978  0.06143675  0.47016997 -1.04281371  0.26751743
 [6] -1.28815480  1.84645931 -1.05787905 -1.19339050  0.08134385
[11]  0.47368117 -0.83474221  1.35581464  0.26824295  0.34538783
[16]  1.79635602  0.82699605  1.23727767  2.14122136 -1.08364015
[21] -0.84026766 -0.87357571 -1.21470169 -0.75991027  1.37604760
[26] -0.41857016  0.30064221  1.68972111 -1.54151876  0.84000654

[[5]]
  numVar charVar facVar ordVar
1      1       a   Poor   Poor
2      2       b   Good   Good
3      3       c  Great  Great

Useful Functions

There are many functions that will be helpful, but here are a few easy ones:

sum(1:10, na.rm = TRUE)
[1] 55
mean(1:10, na.rm = TRUE)
[1] 5.5
sd(1:10, na.rm = TRUE)
[1] 3.02765
paste("tic", "tac", "toe", sep = "-")
[1] "tic-tac-toe"
state.abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN"
[15] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV"
[29] "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
[43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
match(c("January", "February"), month.name)
[1] 1 2
testWords = c("Bad", "Poor", "Great", "Awesome")

dfExample[dfExample$facVar %in% testWords, ]
  numVar charVar facVar ordVar
1      1       a   Poor   Poor
3      3       c  Great  Great
dfExample[!(dfExample$facVar %in% testWords), ]
  numVar charVar facVar ordVar
2      2       b   Good   Good

The Tidyverse

* __  _    __   .    o           *  . 
 / /_(_)__/ /_ ___  _____ _______ ___ 
/ __/ / _  / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
     *  . /___/      o      .       * 

We are going to be working around in the tidyverse for a good chunk of our time together. The whole point of the tidyverse is to offer a grammar of verbs. It is going to help us in a lot of the situations that we are going to be seeing.

Another great feature of the tidyverse is the pipe: %>%

It does the same thing as the Unix |, but | in R is an or operator.

With all of the glowing praise for the tidyverse, we are still going to see some base R. Sometimes, it will demonstrate great reasons for using the tidyverse. In other situations, it will help you to not be afraid to use it when situations arise.

Some Demonstrations

Summary Tables

library(ggplot2)

plotDat = aggregate(diamonds$cut, by = list(cut = diamonds$cut), 
                    FUN = length)

colnames(plotDat)[2] = "n"

plotDat
        cut     n
1      Fair  1610
2      Good  4906
3 Very Good 12082
4   Premium 13791
5     Ideal 21551

Visual

ggplot(plotDat, aes(x = cut, y = n)) +
  geom_point(aes(size = n)) +
  theme_minimal()

(Im)Proper Plotting

Look at help(mtcars) and check out the variables. Can you spot what is wrong with this plot?

ggplot(mtcars, aes(x = wt, y = mpg, color = am)) + 
  geom_point() +
  theme_minimal()

Proper Plotting

The plot below is likely better.

library(dplyr)

mtcars$amFactor = as.factor(mtcars$am) 

ggplot(mtcars, aes(x = wt, y = mpg, color = amFactor)) + 
  geom_point() +
  theme_minimal()

Pipes: Making Life Easier

Recall some of the things that we just saw:

plotDat = aggregate(diamonds$cut, by = list(cut = diamonds$cut), FUN = length)

colnames(plotDat)[2] = "n"

ggplot(plotDat, aes(x = cut, y = n)) +
  geom_point(aes(size = n)) +
  theme_minimal()

This is somewhat tricky code. We have to create a new object with the oft-muddy aggregate and reset a column name (by magic number in an index, no less).

This can be made much easier with dplyr:

diamonds %>% 
  group_by(cut) %>% 
  summarize(n = n()) %>% 
  ggplot(., aes(x = cut, y = n)) +
  geom_point(aes(size = n)) +
  theme_minimal()

It isn’t a reduction in lines, but it is certainly clearer and follows a more logical thought process. This is the whole point of the tidyverse (and dplyr specifically) – allowing you to write how you would explain the process.

As an added bonus, we don’t need to create a bunch of different objects to do something simple.

We can see that dplyr will also make the plot for am easier.

mtcars %>% 
  mutate(am = as.factor(am)) %>%  
  ggplot(., aes(x = wt, y = mpg, color = am)) + 
  geom_point() +
  theme_minimal()

On Code Golf

You will often notice that a dplyr chunk might take a few more lines to work through than base R alone – don’t consider this as a bad thing. There will be many times in this course and in your potential work that you might think that you need to use as few lines as possible. Resist this temptation. Sometime you need to break something up into many lines and create new objects – this ability is exactly why we use R!

Data Import

Importing data is often the easiest part (never too hard to import a nice .csv). Sometimes, though, we need some other strategies.

Delimited Files

Frequently, you will see nicely delimited text files that are not .csv files – these are often tab-delimited files, but they can take other forms.

read.table("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment", 
           header = TRUE, sep = "\t")

Is the same as:

read.delim("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment")

The read.table() function gives you added flexibility to specify many different parameters.

Examine the following file from SDC Platinum and read it in properly:

SDC Wackiness

How did you do?

Did you notice anything about these files? They are not really very big, but they might have taken a little bit of time to read in. There have been times where people have commented that R is too slow on the read side. If you find you files are not being read quickly enough, you can try a few alternatives: readr and data.table

Try the following:

library(readr)

readrTest = read_delim("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment", 
                       delim = "\t")
library(data.table)

dtTest = fread("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment", 
               sep = "\t")

That SDC file that might have taken a few minutes will now take just a few seconds:

sdc = read_delim("https://www3.nd.edu/~sberry5/data/sdcTest.txt", 
                 delim = "^")

Pretty awesome, right?

While readr works wonderfully on the read and write side, data.table is great for wrangling data that is a bit on the big side and is all together blazing fast. However, it does not shy away from confusing syntax and weird conventions. With that in mind, we won’t be using it in this class, but do keep it in the back of your mind.

At times, you will get data in some proprietary format. That is when you need to turn to other places.

Excel

Download the following Excel file: https://www3.nd.edu/~sberry5/data/excelTest.xlsx

readxl::read_excel(path = "")

What do we know about Excel workbooks? Check out the help on readxl and let me know our path forward.

SAS

haven::read_sas(data_file = "https://www3.nd.edu/~sberry5/data/wciklink_gvkey.sas7bdat")

Stata

haven::read_dta(file = "https://www3.nd.edu/~sberry5/data/stataExample.dta")

SPSS

We often see the -99 added as the missing value in SPSS (of course, there is no way that -99 would ever be an actual value, right?).

haven::read_spss(file = "https://www3.nd.edu/~sberry5/data/spssExample.sav", 
                 user_na = "-99")

HTML

Depending on your needs, reading an html table into R is getting to be too easy.

library(rvest)

cpi = read_html("http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/") %>% 
  html_table(fill = TRUE)

Things might get a bit tricky:

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE)

What is the return of this call?

rio

For many of these tasks, you can just use the rio package – you give it the file and it will do the rest!

rio::import("folder/file")

Nested Structures

JSON

Web-based graphics started getting popular not too long ago. Generally, stats people were not using them, but web developer-type folks were. They needed a structure that would work well for the web and interact with their JavaScript-based graphics – thus, JavaScript Object Notation (JSON) was born. You will see JSON come out of many web-based interfaces.

This is what JSON looks like.

There are a few JSON-reading packages in R, but jsonlite tends to work pretty well.

jsonTest = jsonlite::read_json(path = "https://www3.nd.edu/~sberry5/data/optionsDataBrief.json", 
                                simplifyVector = TRUE)

This is a very simple form of JSON. We are going to see a hairier version of this data in the coming days.

JSON Dangers

There is JSON and then there is JSON. You might find yourself some interesting data and want to bring it in, but an error happens and you have no idea why the read_json function is telling you that the file is not JSON.

Not all JSON is pure JSON! When that is the case, you will need to create pure JSON.

Look at this file: https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json

It looks like JSON, but…

jsonlite::validate("https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json")

If we would want to read that in as true JSON, we would need to do some work:

musicalInstruments = readLines("https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json")

musicalInstruments = paste(unlist(lapply(musicalInstruments, function(x) {
  paste(x, ",", sep = "")
})), collapse = "")

musicalInstruments = paste("[", musicalInstruments, "]", sep = "")

musicalInstruments = gsub("},]", "}]", musicalInstruments)

Mass Reading

Everything we just learned is great and you will use them all in your data wrangling missions.

Fortunately (or unfortunately, depending on how you look at it), it is not the whole story – you will frequently be reading in many files of the same time.

If you have two files, you might be able to get away with brute force:

# DO NOT RUN:

myData1 = read.csv("test.csv")

myData2 = read.csv("test2.csv")

Would you want to do this for 5 files? What about 100? Or 1000? I will answer it for you: no!

The chunks below introduce some very important functions. We are going to see lapply again – it is important that you learn to love the apply family!

# DO NOT RUN:

allFiles = list.files(path = "", all.files = TRUE, full.names = TRUE, 
                      recursive = TRUE, include.dirs = FALSE)

allFilesRead = lapply(allFiles, function(x) read.csv(x, stringsAsFactors = FALSE))

allData = do.call("rbind", allFilesRead)

You can also use rio:

# DO NOT RUN:

rio::import_list("", rbind = TRUE)

The Grammar Of Data

One of the major aims of the tidyverse is to provide a clear and consistent grammar to data manipulation. This is helpful when diving deeper into the weeds.

Do you remember this?

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE)

What did we get out of this? It was a big list of data frames. If we are looking for only one thing and we know that it is the first thing, we have some options:

highest = highest[[1]]

This is great for keeping the object at first and then plucking out what we want. If you want the whole thing to be together, though, we have even more options:

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE) %>% 
  `[[`(1)

And now we see why R mystifies people. What does is that bit of nonsense at the end. It is really just an index shortcut. Once you know how to use it, it is great; however, it will make you shake your head if you see it in the wild without knowing about it first.

This is where the benefit of tidyverse becomes clear.

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE) %>%
  magrittr::extract2(1)

Or…

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE) %>%
  purrr::pluck(1)

Both functions are doing the same thing and with slightly different names, but it is crystal-clear what they are doing.

Do be careful, though, because we can have some issues in function masking and pluck from purrr does something very different than pluck from dplyr.

Someone try it and tell me what happens!

Selecting

Base

There are many ways to select variables with base R:

mtcars[, c(1:5, 7:8)]

keepers = c("mpg", "cyl", "disp", "hp", "drat", "qsec", "vs")

mtcars[, keepers]

mtcars[, c("mpg", grep("^c", names(mtcars), values = TRUE))]

You can also drop variables:

mtcars[, -c(1:2)]
                     disp  hp drat    wt  qsec vs am gear carb amFactor
Mazda RX4           160.0 110 3.90 2.620 16.46  0  1    4    4        1
Mazda RX4 Wag       160.0 110 3.90 2.875 17.02  0  1    4    4        1
Datsun 710          108.0  93 3.85 2.320 18.61  1  1    4    1        1
Hornet 4 Drive      258.0 110 3.08 3.215 19.44  1  0    3    1        0
Hornet Sportabout   360.0 175 3.15 3.440 17.02  0  0    3    2        0
Valiant             225.0 105 2.76 3.460 20.22  1  0    3    1        0
Duster 360          360.0 245 3.21 3.570 15.84  0  0    3    4        0
Merc 240D           146.7  62 3.69 3.190 20.00  1  0    4    2        0
Merc 230            140.8  95 3.92 3.150 22.90  1  0    4    2        0
Merc 280            167.6 123 3.92 3.440 18.30  1  0    4    4        0
Merc 280C           167.6 123 3.92 3.440 18.90  1  0    4    4        0
Merc 450SE          275.8 180 3.07 4.070 17.40  0  0    3    3        0
Merc 450SL          275.8 180 3.07 3.730 17.60  0  0    3    3        0
Merc 450SLC         275.8 180 3.07 3.780 18.00  0  0    3    3        0
Cadillac Fleetwood  472.0 205 2.93 5.250 17.98  0  0    3    4        0
Lincoln Continental 460.0 215 3.00 5.424 17.82  0  0    3    4        0
Chrysler Imperial   440.0 230 3.23 5.345 17.42  0  0    3    4        0
Fiat 128             78.7  66 4.08 2.200 19.47  1  1    4    1        1
Honda Civic          75.7  52 4.93 1.615 18.52  1  1    4    2        1
Toyota Corolla       71.1  65 4.22 1.835 19.90  1  1    4    1        1
Toyota Corona       120.1  97 3.70 2.465 20.01  1  0    3    1        0
Dodge Challenger    318.0 150 2.76 3.520 16.87  0  0    3    2        0
AMC Javelin         304.0 150 3.15 3.435 17.30  0  0    3    2        0
Camaro Z28          350.0 245 3.73 3.840 15.41  0  0    3    4        0
Pontiac Firebird    400.0 175 3.08 3.845 17.05  0  0    3    2        0
Fiat X1-9            79.0  66 4.08 1.935 18.90  1  1    4    1        1
Porsche 914-2       120.3  91 4.43 2.140 16.70  0  1    5    2        1
Lotus Europa         95.1 113 3.77 1.513 16.90  1  1    5    2        1
Ford Pantera L      351.0 264 4.22 3.170 14.50  0  1    5    4        1
Ferrari Dino        145.0 175 3.62 2.770 15.50  0  1    5    6        1
Maserati Bora       301.0 335 3.54 3.570 14.60  0  1    5    8        1
Volvo 142E          121.0 109 4.11 2.780 18.60  1  1    4    2        1
dropVars = c("vs", "drat")

mtcars[, !(names(mtcars) %in% dropVars)]
                     mpg cyl  disp  hp    wt  qsec am gear carb amFactor
Mazda RX4           21.0   6 160.0 110 2.620 16.46  1    4    4        1
Mazda RX4 Wag       21.0   6 160.0 110 2.875 17.02  1    4    4        1
Datsun 710          22.8   4 108.0  93 2.320 18.61  1    4    1        1
Hornet 4 Drive      21.4   6 258.0 110 3.215 19.44  0    3    1        0
Hornet Sportabout   18.7   8 360.0 175 3.440 17.02  0    3    2        0
Valiant             18.1   6 225.0 105 3.460 20.22  0    3    1        0
Duster 360          14.3   8 360.0 245 3.570 15.84  0    3    4        0
Merc 240D           24.4   4 146.7  62 3.190 20.00  0    4    2        0
Merc 230            22.8   4 140.8  95 3.150 22.90  0    4    2        0
Merc 280            19.2   6 167.6 123 3.440 18.30  0    4    4        0
Merc 280C           17.8   6 167.6 123 3.440 18.90  0    4    4        0
Merc 450SE          16.4   8 275.8 180 4.070 17.40  0    3    3        0
Merc 450SL          17.3   8 275.8 180 3.730 17.60  0    3    3        0
Merc 450SLC         15.2   8 275.8 180 3.780 18.00  0    3    3        0
Cadillac Fleetwood  10.4   8 472.0 205 5.250 17.98  0    3    4        0
Lincoln Continental 10.4   8 460.0 215 5.424 17.82  0    3    4        0
Chrysler Imperial   14.7   8 440.0 230 5.345 17.42  0    3    4        0
Fiat 128            32.4   4  78.7  66 2.200 19.47  1    4    1        1
Honda Civic         30.4   4  75.7  52 1.615 18.52  1    4    2        1
Toyota Corolla      33.9   4  71.1  65 1.835 19.90  1    4    1        1
Toyota Corona       21.5   4 120.1  97 2.465 20.01  0    3    1        0
Dodge Challenger    15.5   8 318.0 150 3.520 16.87  0    3    2        0
AMC Javelin         15.2   8 304.0 150 3.435 17.30  0    3    2        0
Camaro Z28          13.3   8 350.0 245 3.840 15.41  0    3    4        0
Pontiac Firebird    19.2   8 400.0 175 3.845 17.05  0    3    2        0
Fiat X1-9           27.3   4  79.0  66 1.935 18.90  1    4    1        1
Porsche 914-2       26.0   4 120.3  91 2.140 16.70  1    5    2        1
Lotus Europa        30.4   4  95.1 113 1.513 16.90  1    5    2        1
Ford Pantera L      15.8   8 351.0 264 3.170 14.50  1    5    4        1
Ferrari Dino        19.7   6 145.0 175 2.770 15.50  1    5    6        1
Maserati Bora       15.0   8 301.0 335 3.570 14.60  1    5    8        1
Volvo 142E          21.4   4 121.0 109 2.780 18.60  1    4    2        1

Issues?

For starters, the magic numbers are a no-go. The keepers lines could work, but would be a pain if we had a lot of variables.

Let’s check this wacky stuff out where we want all variables that start with “age” and variables that likely represent questions (x1, x2, x3, …):

library(lavaan)

testData = HolzingerSwineford1939

names(testData)
 [1] "id"     "sex"    "ageyr"  "agemo"  "school" "grade"  "x1"    
 [8] "x2"     "x3"     "x4"     "x5"     "x6"     "x7"     "x8"    
[15] "x9"    
keepers = c(grep("^age", names(testData), value = TRUE), 
            paste("x", 1:9, sep = ""))

testData = testData[, keepers]

Not only do we have another regular expression, but we also have this paste line to create variable names. It seems like too much work to do something simple!

While not beautiful, these are perfectly valid ways to do this work. I have such sights to show you, but don’t forget about this stuff – you never know when you might need to use it.

dplyr

We have already seen a bit of dplyr, but we are going to dive right into some of the functions now.

In base R, we have to do some chanting to select our variables. With dplyr, we can just use select:

mtcars %>% 
  select(mpg, cyl, am)
                     mpg cyl am
Mazda RX4           21.0   6  1
Mazda RX4 Wag       21.0   6  1
Datsun 710          22.8   4  1
Hornet 4 Drive      21.4   6  0
Hornet Sportabout   18.7   8  0
Valiant             18.1   6  0
Duster 360          14.3   8  0
Merc 240D           24.4   4  0
Merc 230            22.8   4  0
Merc 280            19.2   6  0
Merc 280C           17.8   6  0
Merc 450SE          16.4   8  0
Merc 450SL          17.3   8  0
Merc 450SLC         15.2   8  0
Cadillac Fleetwood  10.4   8  0
Lincoln Continental 10.4   8  0
Chrysler Imperial   14.7   8  0
Fiat 128            32.4   4  1
Honda Civic         30.4   4  1
Toyota Corolla      33.9   4  1
Toyota Corona       21.5   4  0
Dodge Challenger    15.5   8  0
AMC Javelin         15.2   8  0
Camaro Z28          13.3   8  0
Pontiac Firebird    19.2   8  0
Fiat X1-9           27.3   4  1
Porsche 914-2       26.0   4  1
Lotus Europa        30.4   4  1
Ford Pantera L      15.8   8  1
Ferrari Dino        19.7   6  1
Maserati Bora       15.0   8  1
Volvo 142E          21.4   4  1

We can also drop variables with the -:

mtcars %>% 
  select(-vs)
                     mpg cyl  disp  hp drat    wt  qsec am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1    4    2
                    amFactor
Mazda RX4                  1
Mazda RX4 Wag              1
Datsun 710                 1
Hornet 4 Drive             0
Hornet Sportabout          0
Valiant                    0
Duster 360                 0
Merc 240D                  0
Merc 230                   0
Merc 280                   0
Merc 280C                  0
Merc 450SE                 0
Merc 450SL                 0
Merc 450SLC                0
Cadillac Fleetwood         0
Lincoln Continental        0
Chrysler Imperial          0
Fiat 128                   1
Honda Civic                1
Toyota Corolla             1
Toyota Corona              0
Dodge Challenger           0
AMC Javelin                0
Camaro Z28                 0
Pontiac Firebird           0
Fiat X1-9                  1
Porsche 914-2              1
Lotus Europa               1
Ford Pantera L             1
Ferrari Dino               1
Maserati Bora              1
Volvo 142E                 1

We also have several helper functions that we can use:

HolzingerSwineford1939 %>% 
  select(num_range("x", 1:9), starts_with("age"), 
         matches("^s.*.l$"))

Not Important, But Helpful

Changing variable position in R is a pain:

head(HolzingerSwineford1939[, c(1, 7:15, 2:6)])
  id       x1   x2    x3       x4   x5        x6       x7   x8       x9
1  1 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
2  2 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
3  3 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
4  4 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
5  5 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
6  6 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000
  sex ageyr agemo  school grade
1   1    13     1 Pasteur     7
2   2    13     7 Pasteur     7
3   2    13     1 Pasteur     7
4   1    13     2 Pasteur     7
5   2    12     2 Pasteur     7
6   2    14     1 Pasteur     7
HolzingerSwineford1939 %>% 
  select(id, starts_with("x"), everything()) %>% 
  head()
  id       x1   x2    x3       x4   x5        x6       x7   x8       x9
1  1 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
2  2 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
3  3 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
4  4 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
5  5 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
6  6 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000
  sex ageyr agemo  school grade
1   1    13     1 Pasteur     7
2   2    13     7 Pasteur     7
3   2    13     1 Pasteur     7
4   1    13     2 Pasteur     7
5   2    12     2 Pasteur     7
6   2    14     1 Pasteur     7

Your Turn!

  1. Use that Stata test file.

  2. Grab every lvi, effect, leader, and cred variable

  3. Use summary to understand your data.

  4. Now, just keep every lvi variable.

  5. Use a corrplot to see relationships.

    • corrplot needs a correlation matrix (use cor)
# Just to give you an idea about how it works!

install.packages("corrplot")

data.frame(x = rnorm(10), y = rnorm(10)) %>% 
  cor() %>% 
  corrplot()

Subsetting/Filtering

One of the more frequent tasks is related to filtering/subsetting your data. You often want to impose some types of rules on your data (e.g., US only, date ranges).

Base

R gives us all the ability in the world to filter data.

summary(mtcars[mtcars$mpg < mean(mtcars$mpg), ])
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :6.000   Min.   :145.0   Min.   :105.0  
 1st Qu.:14.78   1st Qu.:8.000   1st Qu.:275.8   1st Qu.:156.2  
 Median :15.65   Median :8.000   Median :311.0   Median :180.0  
 Mean   :15.90   Mean   :7.556   Mean   :313.8   Mean   :191.9  
 3rd Qu.:18.02   3rd Qu.:8.000   3rd Qu.:360.0   3rd Qu.:226.2  
 Max.   :19.70   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :2.770   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.070   1st Qu.:3.440   1st Qu.:16.10   1st Qu.:0.0000  
 Median :3.150   Median :3.570   Median :17.35   Median :0.0000  
 Mean   :3.302   Mean   :3.839   Mean   :17.10   Mean   :0.1667  
 3rd Qu.:3.600   3rd Qu.:3.844   3rd Qu.:17.94   3rd Qu.:0.0000  
 Max.   :4.220   Max.   :5.424   Max.   :20.22   Max.   :1.0000  
       am              gear            carb       amFactor
 Min.   :0.0000   Min.   :3.000   Min.   :1.000   0:15    
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.250   1: 3    
 Median :0.0000   Median :3.000   Median :4.000           
 Mean   :0.1667   Mean   :3.444   Mean   :3.556           
 3rd Qu.:0.0000   3rd Qu.:3.750   3rd Qu.:4.000           
 Max.   :1.0000   Max.   :5.000   Max.   :8.000           

Unless you know exactly what you are doing, this is a bit hard to read – you might be asking yourself what the comma means and why there is nothing after it.

dplyr

When we use filter, we are specifying what it is that we want to keep.

Keep this or that:

mtcars %>% 
  filter(cyl == 4 | cyl == 8) %>% 
  summary()
      mpg             cyl            disp             hp       
 Min.   :10.40   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.20   1st Qu.:4.00   1st Qu.:120.1   1st Qu.: 93.0  
 Median :18.70   Median :8.00   Median :275.8   Median :150.0  
 Mean   :20.19   Mean   :6.24   Mean   :244.0   Mean   :153.5  
 3rd Qu.:24.40   3rd Qu.:8.00   3rd Qu.:351.0   3rd Qu.:205.0  
 Max.   :33.90   Max.   :8.00   Max.   :472.0   Max.   :335.0  
      drat            wt             qsec             vs     
 Min.   :2.76   Min.   :1.513   Min.   :14.50   Min.   :0.0  
 1st Qu.:3.08   1st Qu.:2.320   1st Qu.:16.90   1st Qu.:0.0  
 Median :3.69   Median :3.435   Median :17.60   Median :0.0  
 Mean   :3.60   Mean   :3.245   Mean   :17.81   Mean   :0.4  
 3rd Qu.:4.08   3rd Qu.:3.780   3rd Qu.:18.61   3rd Qu.:1.0  
 Max.   :4.93   Max.   :5.424   Max.   :22.90   Max.   :1.0  
       am           gear           carb      amFactor
 Min.   :0.0   Min.   :3.00   Min.   :1.00   0:15    
 1st Qu.:0.0   1st Qu.:3.00   1st Qu.:2.00   1:10    
 Median :0.0   Median :3.00   Median :2.00           
 Mean   :0.4   Mean   :3.64   Mean   :2.64           
 3rd Qu.:1.0   3rd Qu.:4.00   3rd Qu.:4.00           
 Max.   :1.0   Max.   :5.00   Max.   :8.00           

Keep this and that:

mtcars %>% 
  filter(cyl == 4 & mpg > 25) %>% 
  summary()
      mpg             cyl         disp              hp        
 Min.   :26.00   Min.   :4   Min.   : 71.10   Min.   : 52.00  
 1st Qu.:28.07   1st Qu.:4   1st Qu.: 76.45   1st Qu.: 65.25  
 Median :30.40   Median :4   Median : 78.85   Median : 66.00  
 Mean   :30.07   Mean   :4   Mean   : 86.65   Mean   : 75.50  
 3rd Qu.:31.90   3rd Qu.:4   3rd Qu.: 91.08   3rd Qu.: 84.75  
 Max.   :33.90   Max.   :4   Max.   :120.30   Max.   :113.00  
      drat             wt             qsec             vs        
 Min.   :3.770   Min.   :1.513   Min.   :16.70   Min.   :0.0000  
 1st Qu.:4.080   1st Qu.:1.670   1st Qu.:17.30   1st Qu.:1.0000  
 Median :4.150   Median :1.885   Median :18.71   Median :1.0000  
 Mean   :4.252   Mean   :1.873   Mean   :18.40   Mean   :0.8333  
 3rd Qu.:4.378   3rd Qu.:2.089   3rd Qu.:19.33   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :2.200   Max.   :19.90   Max.   :1.0000  
       am         gear            carb     amFactor
 Min.   :1   Min.   :4.000   Min.   :1.0   0:0     
 1st Qu.:1   1st Qu.:4.000   1st Qu.:1.0   1:6     
 Median :1   Median :4.000   Median :1.5           
 Mean   :1   Mean   :4.333   Mean   :1.5           
 3rd Qu.:1   3rd Qu.:4.750   3rd Qu.:2.0           
 Max.   :1   Max.   :5.000   Max.   :2.0           

Filter this out:

mtcars %>% 
  filter(cyl != 4) %>% 
  summary()
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :6.000   Min.   :145.0   Min.   :105.0  
 1st Qu.:15.00   1st Qu.:6.000   1st Qu.:225.0   1st Qu.:123.0  
 Median :16.40   Median :8.000   Median :301.0   Median :175.0  
 Mean   :16.65   Mean   :7.333   Mean   :296.5   Mean   :180.2  
 3rd Qu.:19.20   3rd Qu.:8.000   3rd Qu.:360.0   3rd Qu.:215.0  
 Max.   :21.40   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :2.620   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.070   1st Qu.:3.435   1st Qu.:16.46   1st Qu.:0.0000  
 Median :3.150   Median :3.520   Median :17.30   Median :0.0000  
 Mean   :3.348   Mean   :3.705   Mean   :17.17   Mean   :0.1905  
 3rd Qu.:3.730   3rd Qu.:3.840   3rd Qu.:17.98   3rd Qu.:0.0000  
 Max.   :4.220   Max.   :5.424   Max.   :20.22   Max.   :1.0000  
       am              gear            carb       amFactor
 Min.   :0.0000   Min.   :3.000   Min.   :1.000   0:16    
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000   1: 5    
 Median :0.0000   Median :3.000   Median :4.000           
 Mean   :0.2381   Mean   :3.476   Mean   :3.476           
 3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:4.000           
 Max.   :1.0000   Max.   :5.000   Max.   :8.000           

Naturally, it can also take a function

mtcars %>% 
  filter(mpg < mean(mpg)) %>% 
  summary()
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :6.000   Min.   :145.0   Min.   :105.0  
 1st Qu.:14.78   1st Qu.:8.000   1st Qu.:275.8   1st Qu.:156.2  
 Median :15.65   Median :8.000   Median :311.0   Median :180.0  
 Mean   :15.90   Mean   :7.556   Mean   :313.8   Mean   :191.9  
 3rd Qu.:18.02   3rd Qu.:8.000   3rd Qu.:360.0   3rd Qu.:226.2  
 Max.   :19.70   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :2.770   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.070   1st Qu.:3.440   1st Qu.:16.10   1st Qu.:0.0000  
 Median :3.150   Median :3.570   Median :17.35   Median :0.0000  
 Mean   :3.302   Mean   :3.839   Mean   :17.10   Mean   :0.1667  
 3rd Qu.:3.600   3rd Qu.:3.844   3rd Qu.:17.94   3rd Qu.:0.0000  
 Max.   :4.220   Max.   :5.424   Max.   :20.22   Max.   :1.0000  
       am              gear            carb       amFactor
 Min.   :0.0000   Min.   :3.000   Min.   :1.000   0:15    
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.250   1: 3    
 Median :0.0000   Median :3.000   Median :4.000           
 Mean   :0.1667   Mean   :3.444   Mean   :3.556           
 3rd Qu.:0.0000   3rd Qu.:3.750   3rd Qu.:4.000           
 Max.   :1.0000   Max.   :5.000   Max.   :8.000           

Your Turn

For now, we are going to stick with that stataExample data.

  1. Select the same variables, but also include Rater.

  2. Filter the data on Rater – check the values and filter both ways.

  3. Now check those correlations again!

  4. Throw the Gender variable in and filter on that.

New Variables and Recoding

Base

Adding a new variable in base R is as easy as the following:

mtcars$roundedMPG = round(mtcars$mpg)

dplyr

If, however, we want to do things in a tidy chunk, we need to use mutate.

mtcars = mtcars %>% 
  mutate(roundedMPG = round(mpg))

There is also transmute. Can anyone venture a guess as to what it might do?

Base Recoding

You will need to recode variables at some point. Depending on the nature of the recode it can be easy (e.g., to reverse code a scale, you just subtract every value by max value + 1).

You will need to do some more elaborate stuff:

mtcars$mpgLoHi = 0

mtcars$mpgLoHi[mtcars$mpg > median(mtcars$mpg)] = 1
mtcars$mpgLoHi = ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)

These are pretty good ways to do recoding of this nature, but what about this:

mtcars$vs[which(mtcars$vs == 0)] = "v"

mtcars$vs[which(mtcars$vs == 1)] = "s"

Or this:

mtcars$vs = ifelse(mtcars$vs == 0, "v", "s")

dplyr recoding

recode(mtcars$vs, `0` = "v", `1` = "s")

Your Turn!

  1. For the sake of demonstration, select only the first 10 lvi variables and everything else.

  2. Keep only observations with Rater == 0.

  3. Assume that the first 5 lvi variables (01 through 05) are scores for one assessment and the next five (06 through 10) are scores for another assessment.

  4. Create two new variables to capture the mean of those scores.

  • You will need to use the rowwise function ahead of mutate.

  • You can use the mean function, but you will have to wrap the variables in c()

# Just to help you along!

data.frame(x = rnorm(10), y = rnorm(10)) %>% 
  rowwise() %>% 
  mutate(test = mean(c(x, y)))

Communication

We won’t have any big end-of-day wrap exercises to do today. Instead, we are going to learn just a few cool things.

ggplot2

We already saw some ggplot2, but let’s take a few minutes to dive into it a bit more.

Just like everything else in the tidyverse, ggplot2 provides a clear and consistent grammar, except the focus is on data visualization. With ggplot2, we can stack layer after layer into the plotting space to help visualize our data.

Let’s take a look at some good ggplot2 layering:

library(ggplot2)

library(lavaan)

testData = HolzingerSwineford1939

ggplot(testData, aes(x7, ageyr)) +
  geom_point()

Next, we can add some color:

ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75)

Now, we can add a smooth line:

ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75) + 
  geom_smooth()

And we can look at small multiples:

ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75) + 
  geom_smooth() +
  facet_grid(~ sex)

Let’s get those silly grey boxes out of there:

ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75) + 
  geom_smooth() +
  facet_grid(~ sex) +
  theme_minimal()

Perhaps add a better color scheme:

ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75) + 
  geom_smooth() +
  facet_grid(~ sex) +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2")

We could keep going forever and tweak anything that you could imagine (labels, ticks, etc.), but this should give you a pretty good idea about what you can do with regard to static plots.

Oh…but we don’t have to stick with just static plots. We can use the plotly package to make our ggplot object interactive.

library(plotly)

radPlot = ggplot(testData, aes(x7, ageyr)) +
  geom_point(aes(color = as.factor(grade)), alpha = .75) + 
  geom_smooth() +
  facet_grid(~ sex) +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2")

ggplotly(radPlot)

You can also build plots with plotly, but we will save that for another day in the future.

Learning to use ggplot2 will pay great dividends – there is absolutely nothing better for creating visualizations. There is even a whole group of packages that do nothing but add stuff into it.

DT

Visualizations are great and they often tell a better story than tables. Sometimes, though, you want to give people a glimpse of the data. The DT package let’s you create interactive data tables (they are JS data tables).

You could give people the entire data to explore:

library(DT)

datatable(testData)

You can also use the DT package to tidy your summaries into a nice data frame:

lm(x7 ~ ageyr + school, data = testData) %>% 
  broom::tidy() %>% 
  mutate_if(is.numeric, round, 4) %>% 
  datatable()

We don’t want to get too far ahead of ourselves here – we will see more places to use this tomorrow.

R Markdown & Knitr

Do you have a moment to hear the good word of Donald Knuth? If you want to work in a reproducible fashion base and knitr are here to help you out. The slides you saw earlier and even the document you are seeing now are all done with R Markdown. It is my hope that you will also use R Markdown for your presentations on Thursday.

Day 1 Thought Question

Since we used the Stata stuff, let’s keep rolling with that. The Rater variable indicates whether the person is a supervisor (0) or a subordinate (3). Since this data comes from a bigger set, this coding might make sense – it makes no sense for the data at hand. I don’t believe that you would ever do this, but someone wants the leader_age variable discretized into two groups – below or at the mean and above the mean. The same goes with leader_tenure and leader_experience. In addition to these changes, someone is nervous about having both raterNum and leaderID available in the data; they are requesting that at least one of them be removed.

We have a few distinct issues to address within this data – what would you propose that we do?

Summarizing And Grouping

If we recall, we already saw a little bit of grouping and merging (if you don’t, you might remember that mess with aggregate). Given that we already saw aggregate, we will just dive right into the tidyverse.

dplyr

Grouping data and comparing various summary statistics by group is a common task. Sometimes it is just a means of exploration and sometimes it will actually answer the question. No matter the need, you will likely find it quite simple.

library(dplyr)

mtcars %>% 
  summarize(meanMPG = mean(mpg), 
            meanSD = sd(mpg))
   meanMPG   meanSD
1 20.09062 6.026948

You can even summarize all of your variables in a handy way.

mtcars %>% 
  summarize_all(funs(mean, sd), na.rm = TRUE)
  mpg_mean cyl_mean disp_mean  hp_mean drat_mean wt_mean qsec_mean vs_mean
1 20.09062   6.1875  230.7219 146.6875  3.596563 3.21725  17.84875  0.4375
  am_mean gear_mean carb_mean amFactor_mean roundedMPG_mean   mpg_sd
1 0.40625    3.6875    2.8125            NA              20 6.026948
    cyl_sd  disp_sd    hp_sd   drat_sd     wt_sd  qsec_sd     vs_sd
1 1.785922 123.9387 68.56287 0.5346787 0.9784574 1.786943 0.5040161
      am_sd   gear_sd carb_sd amFactor_sd roundedMPG_sd
1 0.4989909 0.7378041  1.6152   0.4989909      6.010743

Because we are dealing with the tidyverse, variable selection is included.

mtcars %>% 
  summarize_at(vars(starts_with("c")), 
               funs(mean, sd), na.rm = TRUE)
  cyl_mean carb_mean   cyl_sd carb_sd
1   6.1875    2.8125 1.785922  1.6152

Combining group_by with summarize welcomes even more power to summarize data.

mtcars %>% 
  group_by(am) %>% 
  summarize(meanMPG = mean(mpg), 
            sdMPG = sd(mpg))
# A tibble: 2 x 3
     am meanMPG sdMPG
  <dbl>   <dbl> <dbl>
1  0       17.1  3.83
2  1.00    24.4  6.17

You are not limited to single group_by statements!

Your Turn

  1. Use the stataData again:
stataExample = haven::read_dta(file = "https://www3.nd.edu/~sberry5/data/stataExample.dta")
  1. Check out the data names and find ones that might be suitable for grouping.

    • Gender, leaderID, and a few others might stick out
  2. Pick a variable to summarize and some type of summary statistic.

    • mean() and sd() are both easy, but be mindful of NAs

Reshaping

Now, things are going to get weird.

Data can take many different forms.

We can have data that looks like this:

wideDat = data.frame(id = 1:3, 
                     age = c(33, 35, 37), 
                     employeeType = c("full", "full", "part"))

Or like this:

wideDat = data.frame(id = rep(1:3, times = 2), 
                     variable = rep(c("age", "employeeType"), each = 3), 
                     value = c(33, 35, 37, 
                               "full", "full", "part"))

The first type, is what many will recongize as standard tabular data. Each row represents an observation, each column is a variable, and each “cell” holds one value.

The second type, long data, is what many will call key-value pairs. You will often see data like this in timeseries data.

You will encounter people who will swear that one way or the other is the ideal way to represent data – we are going to opt for pragmatic as opposed to dogmatic. We can easily switch between these two types of data representations – this is called reshaping.

There is a bit of a hierarchy in R with regard to reshaping data. The reshape function in the stats package can handle most of your needs, but to resulting data is a bit on the ugly side (bad default row names, weird automatic column names, and a bunch of arguments). The reshape package gives you all of the power, but with clearer code and better output. The reshape2 package has all of the power, but with some added functionality. The tidyr package makes things incredibly easy, but at the expense of some flexibility.

Base/stats

The following chunk of code needs the as.data.frame(). Why, you might ask? Almost everything in dplyr converts data to a tibble. Many base R functions will go crazy when they encounter a tibble, so you need to explicitly make it a data frame. You might ask what is the trouble tibbles (anyone?)…

library(ggplot2)

data("starwars")

as.data.frame(starwars) %>% 
  filter(species == "Human" & grepl("(Skywalker)|(Rey)|(Vader)|(Kylo)", .$name)) %>% 
  select(name, height, mass) %>% 
  reshape(., idvar = "name", v.names = "values", varying = list(2:3), 
          times = c("height", "mass"), direction = "long") %>% 
  ggplot(., aes(x = name, y = values, color = time)) + 
  geom_point(size = 3.5) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

reshape

Let’s use the reshape package to do the same thing. You are going to notice a few differences in the function arguments. The reshape packages have this notion of melting (going from wide to long) and casting (going from long to wide). Melting makes plenty of sense to me, but I can only imagine what casting means.

starwars %>% 
  as.data.frame() %>% 
  filter(species == "Human" & 
           grepl("(Skywalker)|(Rey)|(Vader)|(Kylo)", 
                 .$name)) %>% 
  select(name, height, mass) %>% 
  reshape::melt.data.frame(., id.vars = "name", 
                           measure.vars = 2:3, 
                           variable_name = "type", na.rm = TRUE) %>% 
  ggplot(., aes(x = name, y = value, color = type)) + 
  geom_point(size = 3.5) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

Reshape introduced the vernacular, but I really do not see a reason to use it anymore.

reshape2

We don’t need to worry about the tibble issue with reshape2!

starwars %>% 
  filter(species == "Human" & grepl("(Skywalker)|(Rey)|(Vader)|(Kylo)", .$name)) %>% 
  select(name, height, mass) %>% 
  reshape2::melt(., id.vars = "name", 
                           measure.vars = 2:3, variable.name = "type", 
                           value.name = "value", na.rm = TRUE) %>% 
  ggplot(., aes(x = name, y = value, color = type)) + 
  geom_point(size = 3.5) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

tidyr

Allows for dplyr variable selection and a little bit more clarity with creating the id(s) variables.

starwars %>% 
  filter(species == "Human" & grepl("(Skywalker)|(Rey)|(Vader)|(Kylo)", .$name)) %>% 
  select(name, height, mass) %>% 
  tidyr::gather(., key = type, value = value, -name) %>% 
  ggplot(., aes(x = name, y = value, color = type)) + 
  geom_point(size = 3.5) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

The complimentary function to gather is spread:

library(tidyr)

starwarsLong = starwars %>% 
  filter(species == "Human" & grepl("(Skywalker)|(Rey)|(Vader)|(Kylo)", .$name)) %>% 
  select(name, height, mass) %>% 
  gather(., key = type, value = value, -name)

starwarsLong
# A tibble: 10 x 3
   name             type   value
   <chr>            <chr>  <dbl>
 1 Luke Skywalker   height 172  
 2 Darth Vader      height 202  
 3 Anakin Skywalker height 188  
 4 Shmi Skywalker   height 163  
 5 Rey              height  NA  
 6 Luke Skywalker   mass    77.0
 7 Darth Vader      mass   136  
 8 Anakin Skywalker mass    84.0
 9 Shmi Skywalker   mass    NA  
10 Rey              mass    NA  
starwarsLong %>% 
  spread(., key = type, value = value)
# A tibble: 5 x 3
  name             height  mass
  <chr>             <dbl> <dbl>
1 Anakin Skywalker    188  84.0
2 Darth Vader         202 136  
3 Luke Skywalker      172  77.0
4 Rey                  NA  NA  
5 Shmi Skywalker      163  NA  

In addition to reshaping, tidyr has some handy functions for splitting (separate) and pasting (unite) columns.

Others

While we won’t dive into it, the splitstackshape package is very handy for reshaping. It also has additional uses for column manipulations

Your Turn

Merging

Now we are playing with power! Having multiple datasets in memory is one of R’s strong points (not everything can manage such a modern feat). Once you get out there, this becomes important.

Not only can we have multiple datasets open, but we can also merge those datasets together, with the proper variables, of course.

base

The merge function in base R, like everything else, can do us a great amount of good.

board = haven::read_sas()

organization = haven::read_sas()

mergedDat = merge(x = board, y = organization, by = "", 
      all.x = TRUE, all.y = FALSE)

If there is anything good to be gotten from SQL, it is the notion of different joins and the handy language that it provides for specifying those joins. The merge function gives us no such explicit conventions (we would need to intuit or…read the documentation).

Simulated Merryment

Live And Onstage!

Left join = all rows from x and all columns from x and y

Right join = all rows from y and all columns from x and y

Inner join = all rows from x with matching values in y and all columns from x and y

Semi join = all rows from x with matching values in y and just columns from x

Full join = everything

With that knowledge, can we map the various combinations of all.x and all.y?

Left

merge1 = haven::read_dta("https://www3.nd.edu/~sberry5/data/merge1Company.dta")

sasExample = haven::read_sas("https://www3.nd.edu/~sberry5/data/wciklink_gvkey.sas7bdat")

leftTest = left_join(merge1, sasExample, by = "gvkey")

If we want to join on multiple columns, we could provide a character vector:

leftTestMultiple = left_join(merge1, sasExample, by = c("gvkey", "coname"))

If our names don’t match, we need to provide both:

leftTestEqual = left_join(merge1, sasExample, by = c("gvkey", 
                                                "coname", 
                                                "datadate" = "DATADATE1"))

How did this one work? Always check your data!

Inner

innerTest = inner_join(merge1, sasExample, by = c("gvkey"))

Semi

semiTest = semi_join(merge1, sasExample, by = c("gvkey"))

Full

fullTest = full_join(merge1, sasExample, by = c("gvkey"))

Anti

I didn’t mention the anti join before! It does exactly what it sounds like – it finds the things that don’t match. A natural curiousity is the potential purpose for such a function. Can anyone think of anything?

antiTest = anti_join(merge1, sasExample, by = c("gvkey"))

Your Turn!

Let’s look at these four files:

merge1 = "https://www3.nd.edu/~sberry5/data/merge1Company.dta"

merge2Hoberg = "https://www3.nd.edu/~sberry5/data/merge2Hoberg.txt"

merge3McDonald = "https://www3.nd.edu/~sberry5/data/merge3McDonald.csv"

sasExample = "https://www3.nd.edu/~sberry5/data/wciklink_gvkey.sas7bdat"
  1. Read those files in appropriately (look at the file extensions…or rio).
  2. Start merging them together in any way that you can.

Chained merges look like this:

## DO NOT RUN:

left_join(data1, data2, by = "id") %>% 
  left_join(., data3, by = "id") %>% 
  left_join(., data4, by = "id")

Binding

On more than just occasion, you will want to bring data together in a “stacked” manner.

Imagine you have two data files that look exactly alike with regard to column names, but the values are different. This is when we could use a row bind:

data2003 = read.csv("https://www3.nd.edu/~sberry5/data/c2003_a.csv")

data2004 = read.csv("https://www3.nd.edu/~sberry5/data/c2004_a.csv")

# data2013 = read.csv("https://www3.nd.edu/~sberry5/data/")
  
complete = rbind(data2003, data2004)

What if our rows were the same, but we wanted to add some columns? You said cbind, no doubt!

Data Wrangling?

This is a point where we should revisit the term data wrangling. It makes sense conceptually, but it casts a certain mental image that might be limiting. What we have seen up to this point should make it abundantly clear that we are in control of our data – this sits nicely with wrangling. What might not be so clear is the artistically forceful way that we sometimes need to make our data behave. Instead, we might want to think of ourselves as Data Picassos. Data preparation is often done through a series of data deconstructions – much like making a collage. We take bits and pieces from various places and then put them together to make something coherent. This also sits nicely with out previous discussion on code golf.

Therefore, we need to learn to accept a default frame of reference that allows us to break things down into smaller pieces. We are not bound to any monolith.

Keep this concept of data collaging in your mind.

String Cleaning

Data has strings…it is a simple fact of modern data.

If you can clean strings, you can conquer any data task that gets thrown at you. To clean strings, though, you will need to learn how to use magic!

Regular Expressions

Regular expressions (regex) are wild. Regex’s purpose is to match patterns in strings.

Of everything that we have and will see, regex is something that you can use in places outside of data.

Some regular expressions are very easy to understand (once you know what they mean): [A-Za-z]+

Others take some intense trial and error: \([0-9]{3}.[0-9]{3}.*[0-9]{4}

Learning just a little and being able to use them in a variety of settings is most helpful.

Learning regular expressions in R is a bit tough, so let’s go here: regexr.com

stringr

What is the difference between sub and gsub?

What is the difference between grep and grepl?

Why did grep just return a bunch of numbers?

What does the following do: “^\s+|\s+$”

For the love of all that is good, what does regexpr do?

These are just a few of the questions that will come up when working with strings in base R.

There is also the issue of mixed arguments. Consider grep and gsub.

realComments = c("I love wrangling data", "stringz r fun", 
                 "This guy is a hack", "Can't we use excel?")

grep(pattern = "\\b[a-z]{2}\\b", x = realComments, value = TRUE)
[1] "This guy is a hack"  "Can't we use excel?"
gsub(pattern = "(hack)", replacement = "star", x = realComments)
[1] "I love wrangling data" "stringz r fun"         "This guy is a star"   
[4] "Can't we use excel?"  

It is pretty subtle, but the argument order can be a bit troublesome when you are just learning or have not used them in a while.

Check these out:

library(stringr)

str_subset(string = realComments, pattern = "\\b[a-z]{2}\\b")
[1] "This guy is a hack"  "Can't we use excel?"
str_replace_all(string = realComments, pattern = "(hack)", 
                replacement = "star")
[1] "I love wrangling data" "stringz r fun"         "This guy is a star"   
[4] "Can't we use excel?"  

We now have consistent argument order and very clear names.

Clear names and consistent arguments aside, stringr also simplifies some previously cumbersome processes.

matchedComments = regexpr(pattern = "love|like|enjoy", 
                          text = realComments)

regmatches(x = realComments, m = matchedComments)
[1] "love"

This becomes the following with stringr:

str_extract_all(string = realComments, 
                pattern = "love|like|enjoy")
[[1]]
[1] "love"

[[2]]
character(0)

[[3]]
character(0)

[[4]]
character(0)

These are cute examples, but how should we use these for actual data? I am sure you remember this bit of data from yesterday:

library(rvest)

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE) %>%
  magrittr::extract2(1)

Let’s look at the structure of this data:

str(highest)
'data.frame':   50 obs. of  6 variables:
 $ Rank           : chr  "1" "2" "3" "4" ...
 $ Peak           : chr  "1" "1" "3" "3" ...
 $ Title          : chr  "Avatar" "Titanic" "Star Wars: The Force Awakens" "Jurassic World" ...
 $ Worldwide gross: chr  "$2,787,965,087" "$2,187,463,944" "$2,068,223,624" "$1,671,713,208" ...
 $ Year           : int  2009 1997 2015 2015 2012 2015 2015 2011 2017 2013 ...
 $ Reference(s)   : chr  "[# 1][# 2]" "[# 3][# 4]" "[# 5][# 6]" "[# 7][# 8]" ...

Do you see any problems? If you made note of the character nature of “Worldwide gross”, you were astute. R doesn’t recognize dollars and commas as anything other than strings. We need to do some good tidy work here!

highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>% 
  html_table(fill = TRUE) %>%
  magrittr::extract2(1) %>% 
  mutate(gross = stringr::str_replace_all(.$`Worldwide gross`, "\\$|,|[A-Za-z].*", ""), 
         gross = as.numeric(gross))

We are saying to replace all instances in a string where we find \$, or a comma, or any letter followed by anything for 0 or more times.

If you have not worked with regular expressions before today, you might be wondering why there are two slashes in front of the dollar sign – they are escapes. In many regex engines, you just need one escape; in R, though, you need to escape the escape!

Did you notice “Peak” too? Why don’t you handle that one?

On Names, Regex, & Merging

Occasionally, you will will want to merge or bind data, but the columns names are very different. Merge/join gives us a way to counteract this, but binding does not.

When you get into those situation, you can clean up the names of the data with the same tools.

If it is just a matter of case mismatch, this works:

testDF = data.frame(camelCase = 1:10, 
                    normalName = 1:10, 
                    wHyGoDwHy = 1:10)

names(testDF) = stringr::str_to_lower(names(testDF))

You can also do some pattern stuff if needed:

testDF2 = data.frame(peopleDoThis7 = 1:10, 
                     andThis.8 = 1:10, 
                     andEvenThis_9 = 1:10)

names(testDF2) = stringr::str_replace_all(names(testDF2), "\\.|_|\\W", "") 

Let’s add another year into some data that we already saw:

data2003 = readr::read_csv("https://www3.nd.edu/~sberry5/data/c2003_a.csv")

data2004 = readr::read_csv("https://www3.nd.edu/~sberry5/data/c2004_a.csv")

data2013 = readr::read_csv("https://www3.nd.edu/~sberry5/data/c2013_a.csv")
  
complete = rbind(data2003, data2004)

## This will cause an error because of variable names!

complete = rbind(complete, data2013)

Does everything still look good?

Github

Github is an online version of Git. Git can be used as a repository, but its greatest power is in collaboration.

Let’s take a quick look at what Github does and how we can use it for our benefit.

git config --global user.name "saberry"

git config --global user.email "seth.berry@nd.edu"

Day 2 Thought Question

I have given you several different files of the financial stripe, each with a “cik” field that would presumably be used to match the files together in a way that keeps every columns from both data sets. Unfortunately, though, you notice one file’s unique id field contains leading zeros. How would you handle this entire task?

Fuzzy Joins

We have seen joins and some character work, but now we are going to combine them into one world.

Whenever we use joins we are generally looking for exact matches. In reality, we get about 50/50.

Fuzzy joins allow us to use string distance metrics to join non-matching strings together.

We are going to need fuzzyjoin:

install.packages('devtools')

devtools::install_github("dgrtwo/fuzzyjoin")

String Distances

Not including spelling mistakes, there are many different ways to represent words; this is especially true when we are discussing companies.

Would AG Edwards and A.G. Edwards join? Of course not! They are the same company, but not the same words.

We have learned enough about string cleaning to know that we could tidy that one up, but what about something more subtle, like AG Edward and AG Edwards.

We can use the adist function to figure out the distance between these two strings.

string1 = "AG Edward"

string2 = "AG Edwards"

adist(x = string1, y = string2)
     [,1]
[1,]    1

The adist function uses generalized edit distance as the metric. This does things like calculate the number of characters that need to be inserted, deleted, or substituted to make a match.

If we are merging, generalized edit distance might not give use the level of granularity that we need for minimizing.

Let’s try out a few different strings below.

library(stringdist)

string3 = "A G Edwards"

stringdist(string1, string2, method = "jw")
[1] 0.03333333
stringdist(string2, string3, method = "jw")
[1] 0.06363636
stringdist(string1, string3, method = "jw")
[1] 0.0976431

And compare them to our standard edit distance:

adist(c(string1, string2, string3))
     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    1    0    1
[3,]    2    1    0

We can see that we get a little more fine scoring with the Jaro-Winker distance.

Now let’s take a gander at Jaccard’s distance:

stringdist(string1, string2, method = "jaccard")
[1] 0.1111111
stringdist(string2, string3, method = "jaccard")
[1] 0
stringdist(string1, string3, method = "jaccard")
[1] 0.1111111

You might be wondering why we need much granularity. Consider what happens between Safeway and Subway:

stringdist("safeway", "subway", method = "soundex")
[1] 0
stringdist("safeway", "subway", method = "jaccard")
[1] 0.5
stringdist("safeway", "subway", method = "jw")
[1] 0.2539683

I think we can imagine the consequences of using metrics like soundex or Jaccard distance to join strings together.

So, is there a good time to use something like soundex: sure. If you find yourself needing to find any possible word matches.

Fuzzy Joins

Let’s use these files:

library(data.table)

fuzzyJoinClients = fread("https://www3.nd.edu/~sberry5/data/fuzzyJoinClients.csv")

fuzzyJoinCompustat = fread("https://www3.nd.edu/~sberry5/data/fuzzyJoinCompustat.csv")

We are going to merge on “TIENAME” and “Company Name”, but they need some work.

fuzzyJoinClients$TIENAME = tolower(fuzzyJoinClients$TIENAME)

fuzzyJoinCompustat$Company.Name = tolower(fuzzyJoinCompustat$Company.Name)

Let’s try to merge them normally:

standardJoin = left_join(fuzzyJoinClients, fuzzyJoinCompustat, 
                         by = c("TIENAME" = "Company Name"))

Not too good, right?

fuzzyStuff = fuzzyjoin::stringdist_left_join(fuzzyJoinClients, fuzzyJoinCompustat, 
                         by = c("TIENAME" = "Company Name"), 
                         max_dist = .5, method = "jw", 
                         distance_col = "stringDistance")

Tips & Tricks

Row/Group Indices

When you are doing some type of data wrangling tasks (especially when things need grouped or merged), you might find a row or group index to be helpful.

library(dplyr)

carsID = mtcars %>% 
  mutate(rowID = 1:nrow(mtcars)) %>% 
  group_by(cyl) %>% 
  mutate(groupID = 1:n())

carsID
# A tibble: 32 x 15
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00
 2  21.0  6.00   160 110    3.90  2.88  17.0  0     1.00  4.00  4.00
 3  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00
 4  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00
 5  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00
 6  18.1  6.00   225 105    2.76  3.46  20.2  1.00  0     3.00  1.00
 7  14.3  8.00   360 245    3.21  3.57  15.8  0     0     3.00  4.00
 8  24.4  4.00   147  62.0  3.69  3.19  20.0  1.00  0     4.00  2.00
 9  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00
10  19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00
# ... with 22 more rows, and 4 more variables: amFactor <fct>,
#   roundedMPG <dbl>, rowID <int>, groupID <int>

Dates and Times

lubridate

Another glorius thing that will happen to you is working with dates – they tend to be a pain.

Sys.Date()
[1] "2018-03-08"
format(Sys.Date(), "%m-%d-%Y")
[1] "03-08-2018"

And you get things like this:

date1 = "12012018"

date2 = "20180112"

date3 = "12-01-2018"

date4 = "12/01/2018"

If you had to merge data based upon these dates, you would be in trouble.

You could, naturally, use some of the string cleaning stuff that we learned to tear them apart and then re-arrange them in a consistent manner.

dateStrings = strsplit(date3, split = "-")

paste(dateStrings[[1]][3], 
      dateStrings[[1]][2], dateStrings[[1]][1], 
      sep = "-")
[1] "2018-01-12"

If we wanted to go down the path of that previous chunk, we would need to pass the whole thing into an lapply.

Or…we can just do this:

library(lubridate)

mdy(date3)
[1] "2018-12-01"

There is some really great stuff in lubridate. For instance, if you need to get time intervals:

exactAge = function(birthday, 
                    value = c("second", "minute", "hour", "day", 
                              "week", "month", "year")) {
  age = interval(ymd(birthday), ymd(Sys.Date()))  
  
  time_length(age, value)
}

Run that and give it your birthday in “YEAR-MO-DY” format. While it might be goofy at first glance, can we think of anything practical for it?

There is a similar package, hms, for dealing with time.

Lists

Lists are just a part of life at this point. You will see lists of data frames and you will see columns within data frames that contain lists.

If you recall our previous look at JSON, you will see that there is a variable that actually contains a list.

We also saw our starwars example:

glimpse(starwars)
Observations: 87
Variables: 13
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
$ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
$ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
$ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
$ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

There would be a few ways to tackle such an enterprise:

movies = unique(unlist(starwars$films))

starwars = starwars %>% 
  mutate(revengeSith = ifelse(grepl(movies[grep("Sith", movies)], films), 
                              1, 0))
library(tidyr)

data(starwars)

starwars = starwars %>% 
  unnest(films) %>% 
  mutate(id = 1:nrow(.)) %>% 
  spread(key = films, value = films) %>% 
  group_by(name) %>% 
  fill(12:18, .direction = "up") %>% 
  fill(12:18, .direction = "down") %>% 
  slice(1)

Let’s turn back to the ticket option JSON data:

library(jsonlite)

jsonWhole = read_json("https://www3.nd.edu/~sberry5/data/optionsDataComplete.json", 
                      simplifyVector = TRUE)

What do we do with the list variable?

spreadData = lapply(1:nrow(jsonWhole), function(x) {
  
  res = jsonWhole$teamTiers[[x]] %>% 
    reshape2::melt(.) %>% 
    reshape2::dcast(., value ~ tierShortName + variable) %>% 
    select(-value) %>% 
    summarize_all(sum, na.rm = TRUE)
  
  return(res)
}) %>% 
  data.table::rbindlist(., fill = TRUE)

And now, we can bind everything back up:

jsonWhole = cbind(jsonWhole, spreadData) %>% 
  select(-teamTiers)

Back To The Basics

Functions

I genuinely hope you learned a lot from our time together; however, if you only learn one thing, I hope that you feel empowered to write your own functions. For all of R’s greatness, writing your own functions is what really makes it work so well.

You might not feel like you can write a function – we all feel that way at some point. You can!

Functions are simply objects that act upon other objects.

This is a simple function:

fido = seq(1, 10, by = 2)

sum(fido)
meanFunction = function(x) {
  xLength = length(x)
  
  res = sum(x) / xLength
  
  return(res)
}

Specify what you are passing into the function (x above).

R already has a built-in mean() function, but now you know how easy it is to do such things.

When working on data manipulation tasks, functions become very useful because you will find yourself doing the same thing a lot.

– If you find yourself doing something more than twice, write a function!

Your Turn

  1. Create a vector of something.

Something like the following:

x = 1:5

# or...

x = c(1:4, 6)
  1. Tweak your vector in some fashion:
x = x + 1
  1. Create a simple function to do something to your vector:

Maybe try a series of math operations.

testFunc = function(x) {
  res = sum(x * 3)
  return(res)
}

Apply Family

We have seen a lot of different things over the last few hours together. The final bit of wisdom to pass along is the apply family. The apply family (lapply, sapply, mapply, etc.) allows you to pass a function over something like a list and then return something predictable.

testNames = c("Jack", "Jill", "Frank", "Steve", "Harry", "Lloyd")

sapply(testNames, function(x) paste(x, "went over the hill", sep = " "))
                      Jack                       Jill 
 "Jack went over the hill"  "Jill went over the hill" 
                     Frank                      Steve 
"Frank went over the hill" "Steve went over the hill" 
                     Harry                      Lloyd 
"Harry went over the hill" "Lloyd went over the hill" 

The sapply function will return a vector – what do you think an lapply will return? mapply?

Now that is a bit of a goofy example. Since R is vectorized, we would have gotten the same result just by pasting our testNames vector to the string. But, a more realistic situation is as follows:

starwarsShips = unique(unlist(starwars$starships))

shipRider = lapply(starwarsShips, function(x) {
  
  riders = starwars$name[which(grepl(x, starwars$starships))]
  
  res = data.frame(ship = x, 
                   riders = riders)
  
  return(res)
})